ReneWind

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 40000 observations in the training set and 10000 in the test set.

The objective is to build various classification models, tune them and find the best one that will help identify failures so that the generator could be repaired before failing/breaking and the overall maintenance cost of the generators can be brought down.

“1” in the target variables should be considered as “failure” and “0” will represent “No failure”.

The nature of predictions made by the classification model will translate as follows:

So, the maintenance cost associated with the model would be:

Maintenance cost = TP*(Repair cost) + FN*(Replacement cost) + FP*(Inspection cost) where,

Here the objective is to reduce the maintenance cost so, we want a metric that could reduce the maintenance cost.

So, we will try to maximize the ratio of minimum possible maintenance cost and the maintenance cost associated with the model.

The value of this ratio will lie between 0 and 1, the ratio will be 1 only when the maintenance cost associated with the model will be equal to the minimum possible maintenance cost.

Data Description

Importing libraries

Loading Data

Data Overview

EDA and insights

Univariate Analysis

Bivariate Analysis

Bivariate Analysis

Data Pre-processing

Model evaluation criterion

3 types of cost are associated with the provided problem

  1. Replacement cost - False Negatives - Predicting no failure, while there will be a failure
  2. Inspection cost - False Positives - Predicting failure, while there is no failure
  3. Repair cost - True Positives - Predicting failure correctly

How to reduce the overall cost?

Let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.

Defining scorer to be used for hyperparameter tuning

Model Building with Original data

Logistic Regression

Let's evaluate the model performance by using KFold and cross_val_score

Decision Tree

Bagging Classifier

Random Forest Classifier

Boosting

AdaBoost Classifier

Gradient Boosting Classifier

XGBoost Classifier

Model Building with Oversampled data

Logistic Regression - Oversampled

Decision Tree - Oversampled

Bagging Classifier - Oversampled

Random Forest Classifier

Boosting

AdaBoost Classifier

Gradient Boosting Classifier

XGBoost Classifier

Model Building with Undersampled data

Logistic Regression - Undersampled

Bagging Classifier

Random Forest Classifier

Boosting

AdaBoost Classifier

Gradient Boosting Classifier

XGBoost Classifier

Model Selection

HyperparameterTuning

For XGBoost:

param_grid={'n_estimators':np.arange(150,300,50),'scale_pos_weight':[5,10], 'learning_rate':[0.1,0.2], 'gamma':[0,3,5], 'subsample':[0.8,0.9]}

For Gradient Boosting:

param_grid = { "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)], "n_estimators": np.arange(75,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7]}

For Adaboost:

param_grid = { "n_estimators": np.arange(10, 110, 20), "learning_rate": [ 0.2, 0.05, 1], "base_estimator": [ DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1)]}

For logistic Regression:

param_grid = {'C': np.arange(0.1,1.1,0.1)}

For Bagging Classifier:

param_grid = { 'max_samples': [0.8,0.9], 'max_features': [0.8,0.9], 'n_estimators' : [40,50]}

For Random Forest:

param_grid = { "n_estimators": [150,250], "min_samples_leaf": np.arange(1, 3), "max_features": ['sqrt','log2'], "max_samples": np.arange(0.2, 0.6, 0.1)}

For Decision Trees:

param_grid = {'max_depth': np.arange(2,20), 'min_samples_leaf': [1, 2, 5, 7], 'max_leaf_nodes' : [5, 10,15], 'min_impurity_decrease': [0.0001,0.001] }

XGB - Hyperparameter tuning

Random Forest - Model Tuning

Gradient Boosting - Model Tuning

Model Performance comparison and choosing the final model

Test set final performance

Let us explore the feature importance based off this model

Pipelines to build the final model

Business Insights and Conclusions